A Three-Stage Approach 1 Running Head: A THREE-STAGE APPROACH FOR IDENTIFYING GENDER DIFFERENCES A Three-Stage Approach for Identifying Gender Differences on Large-Scale Science Assessments

نویسندگان

  • Rebecca J. Gokiert
  • Jacqueline P. Leighton
چکیده

Recent study into students’ performance on large-scale tests of academic achievement has revealed some tests of achievement, including those in science, as multidimensional (e.g., Ayala, Shavelson, Yin, & Shultz, 2002; Hamilton, Nussbaum, Kupermintz, Kerkhoven, & Snow 1995; Leighton, Gokiert, & Cui, 2005; Nussbaum, Hamilton, & Snow 1997). According to the Standards for Educational and Psychological Assessment group differences can be attributed to the existence of multiple dimensions on a test (AERA, APA, & NCME, 1999). The following research employs a large-scale science assessment to illustrate the utility of a three-stage approach for investigating gender differences in science achievement. The three stages include analysis of dimensionality, differential item functioning (DIF), and think-aloud interviews. Preliminary results indicate that one of the tests examined displayed multidimensionality and two dimensions best describe the test. Furthermore, systematic gender differences were found within each of the dimensions found, where dimension one systematically favoured males and dimension 2 systematically favoured males. Preliminary themes extracted from interview data collected from grade 8 students solving 12 items that displayed large DIF provide some understanding of why the DIF is occurring and whether it is due to bias or impact. A Three-Stage Approach 3 A Three-Stage Approach for Identifying Gender Differences on Large-Scale Science Assessments Large-scale assessment has become a national and international method for monitoring student achievement and for ensuring that educational systems are working (Alberta Education, 2005; Hamilton, Stecher, & Klein, 2002; McGehee & Griffith, 2001). As society and governments place more emphasis on large-scale testing, it has become increasingly important to examine the quality of large-scale testing programs and, in particular, the validity of inferences drawn from large-scale tests. Validity, as defined by the Standards for Educational and Psychological Assessment (AERA, APA, & NCME, 1999, p. 9), is “the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of tests.” Recognizing the importance of test validation, Haladyna and Downing (2004) argue that defining the construct, which is an explicit knowledge and understanding of the latent trait(s) or knowledge and skills being measured on the test, is the first step to be taken in order for appropriate test score interpretation to occur. Construct validation is the process through which the suitability of interpreting test scores is examined in the context of the latent trait(s) or knowledge and skills measured by the test. It is often the case that developers of large-scale assessment tools fail to verify the types of skills that are measured by tests and fail to provide guidelines for how strengths and weaknesses in student performance should be interpreted (Messick, 1994; NRC, 2001). When developers of large-scale assessments fail to explicitly state and incorporate the skills that are measured by these tests, test score interpretation becomes problematic. In order to address concerns surrounding ill-defined constructs A Three-Stage Approach 4 researchers have begun to investigate empirically the underlying knowledge and skills measured by tests (NRC, 2001). In the domain of science, achievement is often characterized by a number of skills, such as quantitative reasoning, scientific reasoning, and spatial-mechanical reasoning (e.g., Hamilton, Nussbaum, Kupermintz, Kerkhoven, & Snow, 1995; Nussbaum, Hamilton, & Snow, 1997). If these dimensions represent distinct knowledge, skills, and attributes in scientific achievement, it is important that tests capture these domains and that test scores reflect an individual’s performance across these areas. Ensuring the subject domain to be measured by the tests is in fact measured will yield student scores that can be validly interpreted in terms of the student’s strengths and weaknesses in the different areas of science achievement. The purpose of this study, therefore, was to examine the utility of three approaches-dimensionality, differential item functioning, and protocol analysis-in collecting evidence about the underlying knowledge and skills measured by the School Achievement Indicators Program (SAIP) Science Assessment administered in 2004 (Council of Ministers of Education, Canada [CMEC], 2000). These approaches were used to (1) determine the dimensional structure of the 2004 administration of the SAIP, (2) determine if the performance of male and female students differs systematically on the SAIP items, and (3) determine if interview data of male and female students can aid in the generation of hypotheses about the underlying knowledge and skills measured by the test. A Three-Stage Approach 5 Dimensionality Much of the research on the construct validation of large-scale science assessments has focused on test dimensionality (Hamilton, et al., 1995; Nussbaum, et al., 1997). Test dimensionality is defined as the smallest number of “dimensions or statistical abilities required to fully describe all test-related differences among the examinees in the population” (Tate, 2002 p. 184). Knowledge of the latent dimensional structure can also provide more meaningful information about test scores, and can ultimately enhance the validity of the inferences made from the test scores (Ayala, Shavelson, Yin, & Shultz, 2002; Childs & Oppler, 2000; Frenette & Bertrand, 2000; Hamilton et al., 1995; Nussbaum et al., 1997). Dimensionality research can help answer questions that address how many latent traits are being measured by a test overall and whether reporting student performance with a single score is reasonable given the number of latent traits found to underlie the test. Study into the complex nature of students’ cognitive skills and its interaction with measures of achievement has revealed that some tests of achievement, specifically in science, are multidimensional (e.g., Ayala et al., 2002; Hamilton, et al., 1995; Leighton, Gokiert, & Cui, 2005; Nussbaum, et al., 1997). Richard E. Snow and his colleagues established the multidimensional nature of science achievement using the NELS: 88 and later a compilation of NELS: 88, TIMSS, and NAEP items (Ayala, et al., 2002; Hamilton et al., 1995; Nussbaum et al., 1997). The dimensional structure that emerged after subjecting the NELS: 88 science test samples for the 8 and 10 grades to a full information factor analysis were four and three factors, respectively. The underlying knowledge and skills measured by the NELS: 88 included dimensions such as, spatialA Three-Stage Approach 6 mechanical reasoning, basic knowledge and reasoning, chemistry knowledge, everyday science knowledge, and reasoning with knowledge. In a study examining the dimensional structure of the SAIP 1999 Science achievement test both traditional and contemporary tests of dimensionality were used (Leighton, et al., 2005). Results, using both factor analytic and nonparametric techniques, indicated that the SAIP science assessment is multidimensional. For grade 8 and grade 11 samples, between two to four factors were found to underlie the data. Although the majority of dimensionality studies examining large-scale science assessments are exploratory in nature, results from exploratory analyses can be used as a data-driven method, both to investigate whether a science assessment is measuring a multidimensional construct, and to guide the development of hypotheses about scientific reasoning. When conducting confirmatory analyses, test specifications can act as a springboard for examining the dimensional structure of science content and skills. However, test specifications do not capture subtle psychological processes and therefore, may not fit the data well in a confirmatory paradigm (Leighton et al., 2005; Norris, Leighton, & Phillips, 2004). That there is a lack of fit between test specifications and the data in the form of student responses in some cases is not surprising given that test specifications do not necessarily represent the cognitive processes students use to respond to test items (Norris et al., 2004). As a result, there has been a push towards the use of cognitive models to better guide large-scale achievement assessment development (Embretson, 1999; Haladyna & Downing, 2004; NRC, 2001; Norris et al., 2004; Snow & Lohman, 1989). The National Research Council (2001) suggests that more meaningful inferences could be made about student knowledge and skills if they were tied to explicit A Three-Stage Approach 7 theories of cognition and learning. Theories of scientific reasoning exist; however, these theories are conceptual in nature, have not been used in test development, and are rarely applied to real test data. The majority of dimensionality studies examine test data after the test has been administered, and attempts to match test score interpretation to existing theories of scientific reasoning occur after the data have been collected (Leighton, Gierl, & Hunka, 2004; Leighton et al., 2005; NRC, 2001). Typically, a theory would be expected to guide test development and then be used to interpret test scores. However, often tests are designed without a theoretical model in mind (Lane, 2004; Leighton et al., 2005). Retrofitting data to existing theories of scientific reasoning, although well intentioned, may result in hypotheses about test score interpretation at best. Performance Differences When attempting to describe the dimensional structure of a test it is also important to identify and understand implicitly the dimensional structure of individual items. Roussos and Stout (1996) define dimension of an item as “any substantive characteristic of an item that can affect the probability of a correct response on the item” (p. 356). When an item is found to measure multiple dimensions, this will result in differential item functioning (DIF) for groups of students. DIF occurs when two groups of examinees with equal ability as indicated by observed test performance do not have the same probability of answering the item correctly. It has been suggested that items that display DIF measure a primary dimension (the dimension the item is intended to measure) along with at least one secondary dimension, which was not intended to be measured by the item (e.g., Messick, 1989; Roussos & Stout, 1996). The secondary dimension(s) that is measured by the item can be representative of the construct or A Three-Stage Approach 8 irrelevant to the construct being measured. Construct-irrelevant variance has been described as a form of systematic error that can affect the probability of an examinee answering an item correctly. When a construct-irrelevant (or nuisance) dimension is present on a test, examinees with lower ability on the nuisance dimension will likely score lower on the test than other examinees who are of equal ability on the dimension of interest, but who have higher ability on the nuisance dimension. Once items have been identified as statistically significant for differential item functioning (DIF), the next step is to determine whether this difference is due to bias or impact. The Standards (1999) prescribe that tests must be free from bias to be considered fair. Willingham and Cole (1997) suggest that “fair test design should provide examinees comparable opportunity, as far as possible, to demonstrate knowledge and skills they have acquired that are relevant to the purpose of the test” (p. 10). Bias, in the context of assessment, occurs when items on a test systematically advantage or disadvantage one group over another even when the groups have the same ability. This bias may result in the inconsistent selection and classification of students, which can have potential consequences if the nature of the selection and classification is high stakes (Moss, 1998). Conversely, an item displays impact if the identified DIF is due to genuine knowledge or experience differences or both. There are many available procedures for identifying items that differentially function for subgroups; however, the capacity to determine whether the DIF is due to bias or impact is still under-developed (e.g., Camilli & Shepard, 1994; Gierl, Bisanz, Bisanz, Boughton, & Khaliq, 2001; Gierl, Rogers, & Klinger; 1999). DIF analyses are a routine part of large-scale assessment testing programs; however, less common are studies to A Three-Stage Approach 9 understand the potential sources of DIF (Gierl et al., 2001). Considering the adverse impact that multiple dimensions in test items can have on the measurement of the desired construct, the validity of inferences that are drawn about student performance needs to be systematically examined using multiple methods. Sources of DIF in large-scale assessments have been explored; these include differences in item format (multiple choice vs. open-ended/constructed response), gender, translation, culture, and background experience (e.g., Ercikan, Law, Arim, Domene, Lacroix, & Gagnon, 2004; Gierl, et al., 1999; Henderson, 1999). Gender differences in science assessment. Gender differences in large-scale assessment are considered by many to be the most carefully examined aspect of test fairness (Ryan & DeMark, 2002). Maccoby and Jacklin’s (1974) pioneering work on gender differences shaped the current research trend toward examining the accuracy of claims that males and females differ on verbal ability, quantitative ability, and spatial ability. Hedges and Nowell (1995) synthesized the results from several gender difference studies, which used nationally representative samples on large-scale assessments. Overall, their analyses suggested that gender differences are small for most areas of achievement, with the exception of writing achievement, science achievement, and stereotypically male related occupations (Hedges & Nowell, 1995). Although some findings suggest that average gender differences in science are decreasing (Linn & Hyde, 1989), Hedges and Nowell (1995) found that across the 32-year period they examined, gender differences were relatively stable. Research on gender differences reveal trends in content and skill areas, in which males and females differ (Beller & Gafni, 1996; Halpern 1997; Hamilton, 1998; Hedges & Nowell, 1995; Linn & Peterson, 1985). These trends A Three-Stage Approach 10 are especially apparent in spatial ability items, which reveal large male advantages along with physical science and earth and space science items. When considering the dimensional structure found previously in the NELS: 88 (Ayala, et al., 2002; Hamilton, et al., 1995; Nussbaum, et al., 1997), a large male advantage was found and the difference was attributed to the performance on the spatial mechanical reasoning (SM) dimension, which consists primarily of items that could be classified as physical science items. Beller and Gafni (1996) analyzed the 1991 International Assessment of Educational Progress (IEAP) and found a significant male advantage on physical science and earth and space science items. A similar pattern of male advantage was found for fourth grade students on the Third International Math and Science Study (TIMSS); while males outperformed females on physical and earth science items, little difference was found between males and females for life and nature of science, or for environmental issues (Hamilton, 1998). Although gender differences are frequently found on spatial ability measures, it is unclear how spatial ability specifically relates to the construct of science achievement. Furthermore, it is unclear why this advantage exists. There exists some research to suggest that males are more attracted to both extracurricular activities and courses that establish and enhance spatial abilities (Hamilton, 1998); however, this hypothesis has not been fully investigated or empirically tested. Gender format differences. The likelihood that male and female performance diverges on assessment format (e.g., multiple choice [MC] and constructed response [CR]) has resulted in several studies examining this possible form of test bias (e.g., Klein, Jovanovic, Stetcher, McCaffrey, Shavelson, Haertel, Solano-Flores, & Comfort, 1997; Resnick & Resnick, 1992). The general trends have indicated that males tend to perform A Three-Stage Approach 11 better than females on MC tasks in science whereas females perform better on CR tasks in science (Resnick & Resnick, 1992). The possibility that MC and CR tasks measure different cognitive skills may explain, in part, why males and females perform differently across these tasks. If MC and CR are measuring different aspects of achievement, an interaction between item format and gender might be expected. The literature to follow has attempted to illuminate some of the reasons for gender differences on item format. A tentative explanation of gender differences on CR tasks is that females experience performance advantages when scores depend on language usage; therefore, resulting in a female advantage on CR tasks (Henderson, 1999; Klein et al., 1997; Stumpf & Stanley, 1996). Klein et al. (1997) demonstrated that females generally performed better than males on hands-on science tasks that required attention to detail and reading. On the other hand, males outperformed females on items that required inferences and prediction. A comprehensive review of gender and fair assessment conducted by Willingham and Cole (1997) led to the conclusion that although females had the tendency to perform better on CR formats than on MC formats, this effect was not consistent, as many studies also demonstrated that females can also perform well on MC items. It was also found that gender format differences in mathematics, language, and literature did not occur as frequently as gender format differences in science test items. Beller and Gafni (2000) suggested that the superior verbal abilities of women may be better illuminated in CR items. They further suggested that writing ability may also play a role in the differential performance of women on CR items. In addition, they suggest that males may perform better on MC items as they take more risks in responding (as evidenced by guessing). A Three-Stage Approach 12 In an attempt to study gender format differences on the NELS: 88 science test and generate hypotheses about why the DIF was occurring, Hamilton (1999) used complimentary methodological approaches. Through the use of statistical DIF analyses and small-scale interviews gender differences were found to occur on items that required visualization requirements and items that required knowledge and skills obtained outside of the educational setting. If males possess stronger visualization skills and are more apt to use them in solving science items, this could explain boys outperforming girls in spatial reasoning. To fully appreciate how the multifaceted associations between format, content, and cognitive processes affect the performance of different groups of students, the investigation of possible contributing item features needs to be examined systematically (Hamilton, 1999). Small-scale interview studies offer one method, which can shed light on the gender differences associated with test item performance (Hamilton, 1999; Ercikan et al., 2004). Uncovering Higher-Level Thinking Skills Interview data, such as think aloud reports, can help yield hypotheses about gender differences and the underlying knowledge and skills measured by tests (Hamilton, et al., 1997; Ercikan et al., 2004). The interest in student performance, which goes beyond simple right and wrong response patterns has “increased the demand for data that trace cognitive processes” (Russo, Johnson, & Stephens, 1989). Think-aloud verbal protocols in which students are asked to verbally report their thoughts as they work through specified tasks have proven useful in examining the underlying cognitive skills that students employ in problem-solving (Ercikan et al., 2004; Ericsson & Simon, 1993; Hamilton et al., 1997; Norris et al., 2004). Think aloud methods offer one way to uncover A Three-Stage Approach 13 the substantive nature of dimensions at both the test level and item level (Hamilton et al., 1997; Leighton, 2004; NRC, 2001). The National Research Council (2001) suggests that the validity of inferences drawn from test performance can be improved when information is gathered about the specific knowledge and skills students actually use during test performance. The common approach in determining the knowledge and skills measured by tests is to consult with content experts, test developers and psychometricians. An inherent limitation to this approach is that content experts typically possess very different problem solving skills than students. Therefore, the hypotheses they generate or inferences they make about student performance may be misinformed (Leighton, 2004; Norris, et al., 2004). Protocol analysis offers an innovative way to support statistical investigations by allowing researchers to examine the actual scientific reasoning skills that students employ (Baxter & Glaser, 1998; Ercikan, et al., 2004; Ericsson & Simon, 1993; Hamilton, et al., 1997; Hamilton, 1998; Leighton, 2004; Norris, et al., 2004) as they solve science tasks. Baxter and Glaser (1998) suggested a theoretical approach for evaluating the construct being measured by examining how the relationship between comprehensive verbal protocols, observation of student performance, and scoring criteria are evidenced in science assessments. Hamilton et al. (1997) used a small-scale interview study to aid in the interpretation of factors from the NELS: 88 science study. From this study, the researchers concluded that small-scale interviews could be used to enhance and support dimension interpretation in order to define the construct more clearly. Moreover, the interviews proved helpful in interpreting items that possessed inconsistent factor loadings. Other recent studies that utilized this method for investigating the latent traits A Three-Stage Approach 14 measured by tests have yielded information about the constructs that are measured by tests and potential explanations for student performance differences (e.g., Ercikan et al., 2004; Hamilton, et al., 1997). If the goal is to make valid inferences about student performance, it is imperative to examine the underlying knowledge and skills students bring to bear on tests of achievement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Appraisal of Research and Development Projects Value-Chain for Complex Products and Systems: The Fuzzy Three-Stage DEA Approach

The purpose of the current research is to provide a performance appraisal system capable of considering the value chain network structure of research and development (R&D) projects for Complex products and systems (CoPS) under uncertainty of data. Therefore, in order to achieve this goal, a network data envelopment analysis (NDEA) approach and the possibilistic programming to provide a new fuzz...

متن کامل

A new approach for performance evaluation of energy-related enterprises

Oil is among the most effective and the largest industries in the world. Given that it supplies a large percentage of the world’s energy and plays a significant role in the national power and international credit of countries, it has a huge impact on our world today. Iran has  huge oil reserves, and plays a key role in the exchange of the required energy in the world. In order to improve the pe...

متن کامل

An ANFIS-based Approach for Predicting the Manning Roughness Coefficient in Alluvial Channels at the Bank-full Stage

An intelligent method based on adaptive neuro-fuzzy inference system (ANFIS) for identifying Manning’s roughness coefficients in modeling of alluvial river is presented. The procedure for selecting values of Manning n is subjective and requires judgment and skill which are developed primarily through experience. During practical applications, researchers often find that a correct choice of the ...

متن کامل

Implementation Procedures for the Risk in Early Design (RED) Method

Risk assessments performed at the conceptual design phase of a product may offer the greatest opportunity to increase product safety and reliability at the least cost. This is an especially difficult proposition, however, as often the product has not assumed a physical form at this early design stage. This paper introduces the Risk in Early Design (RED) method, a method for performing risk asse...

متن کامل

Measuring the efficiency of a three-stage network using data envelopment analysis approach considering dual boundary

This paper presents a method for performance evaluation, ranking and clustering based on the double-frontier view to analyze the complex networks. The model allows us to open the structure of the “black box” and can help to obtain important information about efficient and inefficient points of the system. In this paper, we consider a three-stage network, in respect to the additional desirable a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006